Visualising data

The data under study consists in counts of reads for the Thyroid carcinoma as a Summarized Experiment object @ref(import). This dataset comes from the “Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results” project, with 20115 analyzed genes and 572 samples.

The column data contains pehnotypic data, which corresponds to clinical variables and their corresponding metadata @ref(coldata), it contains 549 columns with different information about the samples. The metadata consists of two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be use in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search.

The row data contains feature data @ref(rowdata). We can observe the characteristics of the studied genes such as their location, ranges, strand, symbol, length and CG content.

library(SummarizedExperiment)

thca <- readRDS("../seTHCA.rds")
thca
class: RangedSummarizedExperiment 
dim: 20115 572 
metadata(5): experimentData annotation cancerTypeCode
  cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(572): TCGA.4C.A93U.01A.11R.A39I.07
  TCGA.BJ.A0YZ.01A.11R.A10U.07 ... TCGA.KS.A41J.11A.12R.A23N.07
  TCGA.KS.A41L.11A.11R.A23N.07
colData names(549): type bcr_patient_uuid ...
  lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
dim(colData(thca))
[1] 572 549
colData(thca)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
                                 type                     bcr_patient_uuid
                             <factor>                             <factor>
TCGA.4C.A93U.01A.11R.A39I.07    tumor E1C8AF06-9EA4-4062-9264-AA916E0025D1
TCGA.BJ.A0YZ.01A.11R.A10U.07    tumor f3ee888e-5116-4313-a2f6-3b1dcc3e4bc1
TCGA.BJ.A0Z0.01A.11R.A10U.07    tumor 72d8dcc3-0709-4fd1-98d4-fb75e9340758
TCGA.BJ.A0Z2.01A.11R.A10U.07    tumor f9ceffc0-d544-418d-b4a9-bd3c84e37026
TCGA.BJ.A0Z3.01A.11R.A13Y.07    tumor 331cae6e-2868-4c58-9302-709a9ff7d025
                             bcr_patient_barcode form_completion_date
                                        <factor>             <factor>
TCGA.4C.A93U.01A.11R.A39I.07        TCGA-4C-A93U           2014-11-12
TCGA.BJ.A0YZ.01A.11R.A10U.07        TCGA-BJ-A0YZ            2011-4-11
TCGA.BJ.A0Z0.01A.11R.A10U.07        TCGA-BJ-A0Z0            2011-4-15
TCGA.BJ.A0Z2.01A.11R.A10U.07        TCGA-BJ-A0Z2            2011-4-15
TCGA.BJ.A0Z3.01A.11R.A13Y.07        TCGA-BJ-A0Z3             2011-6-2
                             prospective_collection
                                           <factor>
TCGA.4C.A93U.01A.11R.A39I.07                    YES
TCGA.BJ.A0YZ.01A.11R.A10U.07                     NO
TCGA.BJ.A0Z0.01A.11R.A10U.07                     NO
TCGA.BJ.A0Z2.01A.11R.A10U.07                     NO
TCGA.BJ.A0Z3.01A.11R.A13Y.07                     NO
mcols(colData(thca), use.names=TRUE)
DataFrame with 549 rows and 2 columns
                                                         labelDescription
                                                              <character>
type                                           sample type (tumor/normal)
bcr_patient_uuid                                         bcr patient uuid
bcr_patient_barcode                                   bcr patient barcode
form_completion_date                                 form completion date
prospective_collection            tissue prospective collection indicator
...                                                                   ...
lymph_nodes_pelvic_pos_total                               total pelv lnp
lymph_nodes_aortic_examined_count                           total aor lnr
lymph_nodes_aortic_pos_by_he                          aln pos light micro
lymph_nodes_aortic_pos_by_ihc                                 aln pos ihc
lymph_nodes_aortic_pos_total                                total aor-lnp
                                        CDEID
                                  <character>
type                                       NA
bcr_patient_uuid                           NA
bcr_patient_barcode                   2673794
form_completion_date                       NA
prospective_collection                3088492
...                                       ...
lymph_nodes_pelvic_pos_total          3151828
lymph_nodes_aortic_examined_count     3104460
lymph_nodes_aortic_pos_by_he          3151832
lymph_nodes_aortic_pos_by_ihc         3151831
lymph_nodes_aortic_pos_total          3151827
metadata(thca)
$experimentData
Experiment data
  Experimenter name: Piccolo SR 
  Laboratory: Dept. of Biology, Brigham Young University, Provo, UT (USA) 
  Contact information: stephen_piccolo@byu.edu 
  Title: Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results 
  URL: https://github.com/srp33/TCGA_RNASeq_Clinical 
  PMIDs: 26209429 
  No abstract available.

$annotation
[1] "org.Hs.eg.db"

$cancerTypeCode
[1] "THCA"

$cancerTypeDescription
[1] "Thyroid carcinoma"

$objectCreationDate
[1] "2016-04-25"
rowData(thca)
DataFrame with 20115 rows and 3 columns
           symbol     txlen      txgc
      <character> <integer> <numeric>
1            A1BG      3322 0.5644190
2             A2M      4844 0.4882329
3            NAT1      2280 0.3942982
4            NAT2      1322 0.3895613
5        SERPINA3      3067 0.5249429
...           ...       ...       ...
20111       POTEB      1706 0.4308324
20112    SNORD124       104 0.4903846
20113   SNORD121B        81 0.4074074
20114      GAGE10       538 0.5055762
20115   BRWD1-IT2      1028 0.5924125
rowRanges(thca)
GRanges object with 20115 ranges and 3 metadata columns:
            seqnames               ranges strand |      symbol     txlen
               <Rle>            <IRanges>  <Rle> | <character> <integer>
          1    chr19 [58345178, 58362751]      - |        A1BG      3322
          2    chr12 [ 9067664,  9116229]      - |         A2M      4844
          9     chr8 [18170477, 18223689]      + |        NAT1      2280
         10     chr8 [18391245, 18401218]      + |        NAT2      1322
         12    chr14 [94592058, 94624646]      + |    SERPINA3      3067
        ...      ...                  ...    ... .         ...       ...
  100996331    chr15 [20835372, 21877298]      - |       POTEB      1706
  101340251    chr17 [40027542, 40027645]      - |    SNORD124       104
  101340252     chr9 [33934296, 33934376]      - |   SNORD121B        81
  102724473     chrX [49303669, 49319844]      + |      GAGE10       538
  103091865    chr21 [39313935, 39314962]      + |   BRWD1-IT2      1028
                 txgc
            <numeric>
          1 0.5644190
          2 0.4882329
          9 0.3942982
         10 0.3895613
         12 0.5249429
        ...       ...
  100996331 0.4308324
  101340251 0.4903846
  101340252 0.4074074
  102724473 0.5055762
  103091865 0.5924125
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome

In this supplementary material we will perform a quality assessment and normalization of the data, in order to do that, we will create a ‘DGEList’ object. In order to ease the manipulation of the data \(\log_2\) CPM values of expression are calculated @ref(logCPM).

library(edgeR)

dge <- DGEList(counts=assays(thca)$counts, genes=mcols(thca))
Warning in as.data.frame(x, row.names = NULL, optional = optional, ...):
Arguments in '...' ignored
saveRDS(dge, file.path("./", "dge.rds"))
assays(thca)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(thca)$logCPM[1:5, 1:5]
   TCGA.4C.A93U.01A.11R.A39I.07 TCGA.BJ.A0YZ.01A.11R.A10U.07
1                      2.408053                     3.875800
2                      9.483176                    10.203685
9                     -6.667872                    -6.667872
10                    -6.667872                    -6.667872
12                     3.688878                     4.988388
   TCGA.BJ.A0Z0.01A.11R.A10U.07 TCGA.BJ.A0Z2.01A.11R.A10U.07
1                      2.778138                     4.239719
2                      8.890882                     8.403990
9                     -6.667872                    -6.667872
10                    -6.667872                    -6.667872
12                     4.804717                     4.839157
   TCGA.BJ.A0Z3.01A.11R.A13Y.07
1                      2.655578
2                      9.190160
9                     -6.667872
10                    -6.667872
12                     4.605054

Quality assessment

Important Variables

In thyroid cancer we consider some variables more important than others as, according to bibliography, significant differences are seen in the categories of this variables related to thyroid cancer. We want to have a first look to this categories before the analysis of the whole data. This will give us an idea of the distribution of our data and can be useful in further analysis @ref(numvars).

par(mfrow=c(3, 1))
#Type
tumor <- thca[ , colData(thca)$type == "tumor"]
normal <- thca[ , colData(thca)$type == "normal"]
dim(tumor)
[1] 20115   513
dim(normal)
[1] 20115    59
barplot(summary(colData(thca)$type), main="Type of sample", xlab="Sample type", ylab="Total counts")
#Gender
male <- thca[ , !is.na(thca$gender) & colData(thca)$gender == "MALE"]
female <- thca[ , !is.na(thca$gender) & colData(thca)$gender == "FEMALE"]
dim(male)
[1] 20115   149
dim(female)
[1] 20115   406
na <- thca[ , is.na(thca$gender)]
dim(na)
[1] 20115    17
barplot(summary(colData(thca)$gender), main="Patient's gender", xlab="Gender", ylab="Total counts")
#Age
barplot(summary(colData(thca)$age_at_diagnosis),main="Patient's age", xlab="Age", ylab="Total counts")

The last “age” column correspond to NA, but we do not remove them in this plot because we may need those samples in further analysis. A we can see, we have 513 tumors and 59 normal samples; or, lookig at the gender, 149 males and 406 females, with 17 samples from unknown gender.

Sequencing depth

In this part, we examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure @ref(libsizes) below shows the sequencing depth per sample, also known as library sizes, in increasing order.

Fig1.Library sizes in increasing order.

Fig1.Library sizes in increasing order.

In this plot we can visually observe again that we have more samples from tumor than normal, both types have similar sequencing depth. However, the fugure also reveals substantial differences in sequencing depth between samples, this is why the discardation of those samples whose depth is substantially lower than the rest needs to be taked into consideration.

Sample Depth

In order to remove those samples with a lower sequencing depth value, we first calculate the sample depth @ref(smpdepth), and then select a cutoff to remove samples.

sampledepth <- round(dge$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(thca), 6, 12)
sort(sampledepth)
DJ.A3VJ ET.A3BP ET.A3DO EL.A4K7 EL.A4K2 EL.A3CP EM.A22Q E8.A417 DJ.A13W 
    0.5     2.0     2.2     2.6     3.2    14.4    15.6    16.9    18.0 
EM.A22O KS.A41I EL.A3ZS ET.A25M ET.A25L BJ.A28T EM.A3FQ FK.A4UB FE.A3PD 
   19.2    20.5    20.7    20.9    21.7    22.3    22.9    23.0    23.1 
IM.A41Z DJ.A13R DJ.A1QG KS.A41F EM.A2CK BJ.A45E EM.A1YB ET.A4KN EM.A1CT 
   23.3    23.7    23.7    23.7    24.1    24.9    26.4    26.7    27.4 
EL.A4JX EL.A3ZL BJ.A0ZE KS.A4I7 BJ.A28W DJ.A2PY DJ.A4UR FY.A3R6 DE.A7U5 
   27.5    28.5    29.6    29.6    30.4    30.4    30.5    30.6    31.0 
EM.A1YE EL.A3ZG DJ.A2PQ BJ.A28W EM.A3O3 EM.A22M EM.A2CM DE.A4MA EL.A3H5 
   31.2    31.6    32.7    32.7    32.8    32.9    33.5    33.7    33.7 
BJ.A18Z QD.A8IV EM.A1CW KS.A4I1 FY.A3TY EL.A3ZQ MK.A84Z L6.A4EP DE.A4M8 
   34.0    34.1    34.3    34.4    34.8    34.9    34.9    35.3    35.4 
ET.A25R GE.A2C6 ET.A3DR FE.A230 DJ.A2QA EM.A4G1 FY.A76V KS.A4ID EM.A4FM 
   35.6    35.6    35.7    35.8    36.0    36.2    36.2    36.2    36.6 
BJ.A0Z9 EM.A3SZ ET.A25K EL.A3CW DE.A4MB DJ.A2Q0 FE.A22Z FE.A3PC KS.A4IB 
   36.9    36.9    36.9    37.0    37.1    37.4    37.5    37.7    37.7 
E8.A437 DJ.A4UT EL.A4K1 EM.A3ST E8.A2JQ BJ.A28S E8.A3X7 EM.A3O9 KS.A4I3 
   38.0    38.1    38.2    38.5    38.6    38.7    38.9    39.0    39.3 
BJ.A191 EM.A1YC BJ.A0Z2 BJ.A45F EM.A3SU ET.A40S E8.A44K EL.A3D5 DJ.A4V4 
   39.4    39.4    39.6    39.6    39.6    39.6    39.7    39.9    40.0 
FY.A3YR EM.A2CO EM.A4FV ET.A39M BJ.A2NA DJ.A2PZ BJ.A0Z5 BJ.A45H EM.A2OY 
   40.4    40.5    40.6    40.7    40.7    40.8    40.9    41.2    41.2 
ET.A3DU KS.A4I9 DJ.A3VL L6.A4EQ EL.A4K9 BJ.A2P4 EM.A3FN IM.A4EB FY.A3TY 
   41.2    41.2    41.3    41.5    41.6    41.7    41.7    41.8    41.8 
4C.A93U DJ.A13V E8.A433 EM.A3ST BJ.A3PT EM.A3SY KS.A41I BJ.A0ZJ ET.A3DQ 
   41.9    42.0    42.0    42.2    42.3    42.4    42.4    42.6    42.6 
DE.A4MD ET.A39R J8.A4HW KS.A4I5 EM.A2P1 KS.A4IC IM.A41Y KS.A41L FY.A3BL 
   42.7    42.8    42.9    42.9    43.0    43.0    43.1    43.1    43.2 
FY.A4B0 EL.A3D4 MK.A4N9 EL.A3CM DJ.A1QO J8.A4HY DJ.A3V0 EM.A2CP BJ.A28Z 
   43.3    43.6    43.6    43.8    44.0    44.0    44.1    44.1    44.2 
DE.A69K DJ.A3V6 EM.A2P1 BJ.A4O8 DE.A4M9 EL.A3N3 EL.A3ZN EM.A3AO BJ.A0YZ 
   44.4    44.4    44.4    44.5    44.5    44.6    44.6    44.6    44.7 
E8.A44M BJ.A3PR DE.A0XZ BJ.A3PR CE.A13K DJ.A3UW EL.A3GO EL.A3ZT EL.A4KI 
   44.7    44.8    44.8    44.8    44.9    44.9    44.9    44.9    44.9 
EM.A3AJ EM.A2OW EM.A1CU DJ.A2QC EL.A3CU BJ.A0ZC CE.A3MD DJ.A4UL BJ.A45I 
   44.9    45.0    45.0    45.1    45.1    45.2    45.2    45.2    45.3 
EL.A3CX E8.A2JQ EL.A3ZM DJ.A13M E3.A3E0 ET.A2N3 E8.A438 EL.A3CL J8.A3YH 
   45.3    45.3    45.4    45.5    45.5    45.7    45.9    45.9    45.9 
EM.A4FR EL.A3GZ DJ.A2PS EL.A3GR EM.A4FN EM.A3FP ET.A39T DJ.A3VM EL.A4K6 
   46.0    46.0    46.3    46.4    46.4    46.5    46.5    46.6    46.6 
EM.A2OV E8.A2EA EM.A1YD ET.A2N0 FY.A40N BJ.A291 EL.A4JV ET.A2N5 BJ.A45C 
   46.6    46.7    46.7    46.7    46.7    46.8    46.8    46.8    46.9 
ET.A3BW E3.A3E5 BJ.A28R EL.A3H8 EL.A3GP EL.A3T1 DO.A1K0 EL.A3H3 EM.A4FO 
   46.9    47.0    47.1    47.2    47.3    47.3    47.4    47.4    47.4 
FY.A3NN J8.A3YD EL.A3ZP CE.A485 EL.A3T3 DJ.A3VK ET.A2MX BJ.A0ZG BJ.A3PU 
   47.4    47.4    47.5    47.7    47.7    47.8    47.8    47.9    47.9 
EL.A3ZP ET.A3DT EL.A3ZK ET.A40P BJ.A0Z0 DE.A0Y3 EM.A2CR ET.A39N CE.A3ME 
   47.9    47.9    48.0    48.1    48.2    48.2    48.2    48.2    48.3 
DJ.A4UW EL.A3TA EL.A3ZS EM.A3AQ ET.A2MY EM.A1YC L6.A4ET EM.A2OX EM.A3AR 
   48.3    48.3    48.3    48.3    48.3    48.4    48.4    48.5    48.5 
DJ.A3VE EL.A3GU CE.A484 DJ.A2PT DO.A1JZ EL.A3ZK ET.A2MX E3.A3E2 EL.A3T8 
   48.6    48.6    48.7    48.7    48.7    48.9    48.9    49.0    49.0 
FY.A2QD DJ.A4UQ FY.A40K EL.A3ZO DJ.A4UP DJ.A4V0 E8.A419 EL.A3D0 EL.A3D6 
   49.0    49.1    49.1    49.1    49.2    49.2    49.2    49.2    49.2 
EL.A3D1 EL.A3GQ J8.A42S ET.A39O J8.A4HW BJ.A192 DJ.A2QB DJ.A3VD E3.A3DZ 
   49.3    49.3    49.3    49.4    49.4    49.5    49.6    49.6    49.6 
DJ.A2Q1 DJ.A3UU E8.A413 EL.A3T0 EL.A3CN EL.A3CS ET.A40T DE.A4MD EM.A2CN 
   49.7    49.7    49.7    49.7    49.8    49.8    49.8    49.9    49.9 
DO.A1JZ BJ.A28X EL.A4JZ EM.A1CV BJ.A0ZA DJ.A1QL DJ.A2PO EM.A22I CE.A481 
   50.0    50.1    50.1    50.1    50.2    50.3    50.3    50.3    50.4 
EL.A3MX EL.A4K0 EL.A4JW ET.A39K ET.A3BV EL.A4KH ET.A2N4 EM.A1CS BJ.A2N9 
   50.4    50.4    50.5    50.5    50.5    50.6    50.6    50.6    50.7 
BJ.A18Y DO.A2HM EM.A1CU BJ.A28X BJ.A45G DJ.A4V5 EM.A2P3 ET.A39P EM.A2P0 
   50.8    50.8    50.8    50.9    51.0    51.0    51.0    51.0    51.1 
J8.A3O2 ET.A3DW DJ.A3V3 EL.A3CO EM.A3SX BJ.A2N9 ET.A25N DJ.A4V2 EM.A2CJ 
   51.1    51.1    51.2    51.2    51.2    51.2    51.3    51.4    51.4 
BJ.A3F0 EM.A3AI FY.A3I5 DJ.A13U EL.A3CZ EL.A3GZ FE.A232 EL.A3T8 EL.A3ZR 
   51.5    51.5    51.5    51.6    51.7    51.8    51.8    51.8    51.9 
J8.A3YF EL.A3ZT DJ.A3UZ EM.A22P DE.A0Y2 EL.A3H2 EM.A4FK ET.A39S BJ.A45K 
   51.9    51.9    52.1    52.3    52.4    52.4    52.4    52.4    52.5 
DJ.A13L EL.A3CY FY.A3RA DJ.A3UK H2.A422 J8.A3O2 E3.A3E3 EL.A3T0 ET.A3BO 
   52.5    52.6    52.6    52.7    52.7    52.7    52.9    53.0    53.0 
J8.A3YE ET.A39J ET.A3DW MK.A4N6 DJ.A1QQ EM.A3FO ET.A25G FE.A3PA FY.A40M 
   53.0    53.1    53.1    53.1    53.2    53.2    53.2    53.2    53.2 
EL.A3GV EL.A3MW EL.A3H2 DJ.A3UR EM.A2CS FE.A235 EL.A3MW ET.A3BX BJ.A0ZH 
   53.3    53.3    53.4    53.6    53.6    53.6    53.6    53.7    53.8 
DJ.A3UN EL.A3MY EM.A3FQ FY.A4B4 IM.A420 DE.A3KN DJ.A2PW EL.A3T2 EL.A3CT 
   53.8    53.9    53.9    54.0    54.0    54.1    54.1    54.3    54.4 
EL.A3T6 EL.A3CR EM.A22J FE.A237 BJ.A45D BJ.A45J EM.A3FR FY.A40L ET.A3BN 
   54.4    54.5    54.5    54.5    54.6    54.6    54.6    54.6    54.7 
CE.A482 DJ.A3VI EM.A1YA EM.A2CS BJ.A190 EL.A3TB EL.A3T9 EM.A3FL KS.A41L 
   54.8    54.8    54.8    54.8    54.9    54.9    55.0    55.0    55.0 
EL.A4KD FY.A3ON J8.A3O0 DJ.A3VG EL.A3GX EM.A22K EM.A3O7 E8.A432 E8.A436 
   55.1    55.1    55.1    55.2    55.2    55.2    55.2    55.3    55.3 
J8.A3YH EL.A3H7 EL.A3ZR BJ.A290 DJ.A3UP EL.A3GS DJ.A2PR DJ.A2PV EL.A3N2 
   55.3    55.3    55.4    55.5    55.5    55.5    55.6    55.6    55.6 
KS.A41J DJ.A3UO EL.A3H4 DE.A4MC DJ.A3US FK.A3S3 DJ.A3UQ E8.A434 FE.A3PB 
   55.7    55.8    55.8    55.9    55.9    55.9    56.0    56.1    56.1 
FY.A4B3 DJ.A2PN H2.A421 IM.A3U2 EM.A1CW EM.A2CU DJ.A2Q9 EL.A3ZH EM.A4FQ 
   56.1    56.2    56.2    56.2    56.3    56.3    56.5    56.5    56.5 
EM.A1CT EL.A3CV ET.A3BS DJ.A3UY ET.A25J CE.A483 EL.A3MZ FE.A239 EL.A3MY 
   56.6    56.7    56.7    56.8    56.8    56.9    56.9    56.9    56.9 
DJ.A3UX E8.A416 EL.A3T7 ET.A25O ET.A3BU EM.A3FK ET.A3DS GE.A2C6 DJ.A2PX 
   57.0    57.0    57.0    57.2    57.2    57.3    57.3    57.3    57.4 
DJ.A1QE DJ.A3UT EL.A3H1 EM.A3FM EL.A3ZL DJ.A1QI EL.A4K4 BJ.A2NA DJ.A13S 
   57.5    57.5    57.5    57.7    57.7    57.8    57.9    58.0    58.3 
EM.A3OB BJ.A2N8 FY.A3R9 EL.A3T3 EL.A3GW IM.A3ED ET.A40R EM.A2P2 DJ.A13P 
   58.3    58.5    58.5    58.5    58.6    58.6    58.7    58.8    58.9 
DJ.A13X DJ.A3V5 BJ.A0ZB DJ.A2Q2 EL.A3ZG DJ.A3VA EL.A3N3 FY.A3R8 ET.A2N5 
   58.9    58.9    59.0    59.0    59.0    59.1    59.1    59.2    59.2 
DJ.A2PU FK.A3SH DJ.A3UV EM.A3SU EM.A4FU H2.A3RH J8.A3NZ DJ.A3V4 EL.A3TB 
   59.3    59.3    59.4    59.4    59.4    59.4    59.4    59.7    59.7 
DJ.A3V7 DJ.A3V8 EL.A3ZM J8.A3O1 ET.A39L H2.A3RI EM.A4FH DJ.A2Q5 ET.A25P 
   59.8    59.8    59.8    59.8    59.9    60.1    60.2    60.3    60.4 
BJ.A4O9 EM.A3FJ EL.A3T2 E3.A3DY EL.A3T6 EL.A3ZH E3.A3E1 EM.A1CS ET.A3DV 
   60.5    60.5    60.8    60.9    61.0    61.0    61.1    61.1    61.1 
E8.A414 ET.A39I FY.A3R7 H2.A26U IM.A3EB DJ.A13T DJ.A2PP EM.A4FF MK.A4N7 
   61.2    61.3    61.3    61.3    61.4    61.6    61.6    61.7    61.8 
EL.A3TA IM.A3U3 EL.A3ZO FK.A3SG DJ.A3V2 ET.A3BT EL.A3H7 EM.A22N FK.A3SD 
   61.8    62.0    62.1    62.3    62.5    62.6    62.8    62.8    62.9 
BJ.A3EZ BJ.A28V FK.A3SB H2.A2K9 DJ.A3VF FE.A238 EM.A3AK BJ.A3PU BJ.A0ZF 
   63.0    63.1    63.1    63.2    63.3    63.3    63.5    63.5    63.6 
EL.A3N2 EL.A4KG EL.A3H1 EM.A3AL EM.A3AN FY.A3NP DJ.A3V9 ET.A3DP EM.A2CT 
   63.7    63.8    63.8    64.1    64.2    64.7    64.9    65.0    65.3 
FE.A23A FE.A236 BJ.A2N7 ET.A3DP E8.A415 EM.A2CL FE.A234 KS.A41J BJ.A2N7 
   65.3    65.4    65.4    65.5    65.8    65.8    65.8    65.9    66.2 
FY.A3WA EM.A2CQ H2.A3RI EL.A3ZQ ET.A2MZ J8.A3YG EL.A3T1 ET.A25I E8.A418 
   66.3    66.4    66.4    66.5    67.8    68.1    68.2    68.9    69.0 
EM.A3O6 EM.A1CV FY.A3W9 L6.A4EU BJ.A290 ET.A3BQ DJ.A2Q3 DJ.A3UM BJ.A0Z3 
   69.1    69.2    69.4    69.4    69.4    69.7    69.8    69.9    70.4 
EM.A3O8 DJ.A1QH EL.A3MX ET.A40Q FY.A3I4 FK.A3SE ET.A4KQ FE.A231 DE.A69J 
   70.8    71.4    72.0    72.3    72.3    72.4    72.6    72.8    73.0 
DJ.A1QD BJ.A28R CE.A27D EM.A3AP DJ.A2Q7 EL.A3GY E8.A242 DE.A2OL DJ.A2Q6 
   73.4    73.5    73.5    73.5    73.6    73.6    74.6    75.0    75.5 
DJ.A1QM DJ.A2Q4 DJ.A3VB EL.A3T7 EM.A22L FY.A3NM DJ.A13O EM.A2OZ DJ.A1QN 
   75.8    76.6    76.6    77.2    78.9    79.1    82.9    84.0    85.7 
DJ.A1QF BJ.A2N8 H2.A2K9 EM.A3OA FE.A233 
   86.2    89.1    91.1    95.4   104.0 

Moreover, we know that this cancer has different affection depending on the gender, being women more susceptibles, so maybe we would like to compare samples regarding the gender. Samples with unknown (NA) gender will be removed in order to allow further analysis.

In conclusion, two filters are applied @ref(maskdg):

  • Remove all sample with a sample depth lower than 40.
  • Remove samples without known gender.
maskd <- sampledepth > 40 #sample depth filter
maskg <- !is.na(thca$gender) #gender filter
dim(thca)
[1] 20115   572
thca.filt <- thca[,maskd&maskg]
dge.filt <- dge[,maskd&maskg]
dim(thca.filt)
[1] 20115   467

Once the filters are applied, the dataset contains 20115 genes and 467 samples. This way, none of our samples has a low sample depth @ref(filtlibsizes).

Fig1.Library sizes in increasing order.

Fig1.Library sizes in increasing order.

Distribution of expression levels among samples

One way to normalize RNA-seq data is an adjustment to compare across features in a sample, this can be performed using count per million reads (CPM). The distribution of expression values per samle in terms of logarithmic CPM units is ploted @ref(distRawExp) separating by tumor and normal samples due to the large number of samples. A box plot of the expression values per samples is also performed @ref(distboxplot) in order to have another visual way to interpret the data and spot location differences.

Fig2. Non-parametric density distribution of expression profiles per sample.

Fig2. Non-parametric density distribution of expression profiles per sample.

Fig3. Box plot of the distribution expression values across samples

Fig3. Box plot of the distribution expression values across samples

From this plots, it is observes that all samples follow a similar pattern with a group of highly expressed genes and another group of lowly expressed genes, following the common pattern of expression. From the box plot we can observe that there are no samples that deviate from the average interquartile range. For all this reasons, we assume that we do not need to normalise among samples.

Distribution of expression levels among genes

In order to identify lowly expressed genes, the average expression per gene through all samples is calculated. The distribution of thos values across genes is represented @ref(exprdist).

Fig4. Distribution of average expression level per gene.

Fig4. Distribution of average expression level per gene.

The most part of the genes have at least a log2CPM greater than 0, and we want to remove those one lowly expressed.

Filtering of lowly-expressed genes

We can fiter lowly-expreesed genes following two criteria:

  • Filter out genes below a minimum average log2CPM throught the samples.

  • Filter out genes with fewer than a given number of sample meeting a minimum log2CPM.

First, we are going to filter out genes with the second approach, and visualize it with an histogram to know if we want also to perform the other filter.

#calculate cpm cutoff around all samples
cpmcutoff <- round(10/min(dge.filt$sample$lib.size/1e+06), digits = 1)
cpmcutoff
[1] 0.2
#select number of samples meeting that cutoff
nsamplescutoff <- min(table(thca.filt$type))
nsamplescutoff
[1] 53
mask <- rowSums(cpm(dge.filt) > cpmcutoff) >= nsamplescutoff
thca.filt2 <- thca.filt[mask, ]
dge.filt2 <- dge.filt[mask, ]
dim(thca.filt2)
[1] 15408   467
Fig5. Distribution of average expression level per gene with first filter.

Fig5. Distribution of average expression level per gene with first filter.

Now, we have a dataset of 15408 genes to analyzed.

In this histogram, we have all genes that have passed the cutoff (red bars). However, we observe that the first red columns are not very representative of the group, and as we have a lot of genes to perform the analysis, we are going toapply a second filter.

To remove them, we choose a cutoff of 3 log2 CPM unit as minimum value of expression and visualize the fitered dataset with another histogram.

mask <- avgexp > 3
dim(thca.filt2)
[1] 15408   467
thca.filt3 <- thca.filt[mask, ]
dim(thca.filt3)
[1] 9423  467
dge.filt3 <- dge.filt[mask, ]
dim(dge.filt3)
[1] 9423  467
Fig6. Distribution of average expression level per gene with second filter.

Fig6. Distribution of average expression level per gene with second filter.

After filtering the dataset with both methods, we have finally 9423 genes to perform a differential expression analysis.

Store un-normalized versions of the filtered expression data.

saveRDS(thca.filt3, file.path("./", "thca.filt.unnorm.rds"))
saveRDS(dge.filt3, file.path("./", "dge.filt.unnorm.rds"))

Normalization

We calculate now the normalization factors on the filtered expression data set.

dge.filt3 <- calcNormFactors(dge.filt3)
head(dge.filt3$samples$norm.factors)
[1] 0.8471913 0.9717390 0.9386724 1.0353601 1.0518953 0.9920039

Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.

assays(thca.filt3)$logCPM <- cpm(dge.filt3, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)

Store normalized versions of the filtered expression data.

saveRDS(thca.filt3, file.path("./", "thca.filt.rds"))
saveRDS(dge.filt3, file.path("./", "dge.filt.rds"))

MA-plots

MA plots are used for detecting intensity dependent biases, by comparing two groups of the dataset,

We first examine a global MA-plot for normalized and non-normalized data to have a general idea of the possible dependencies.

< table of extent 0 >
Fig7. MA-plots of normalized and non-normalized data.

Fig7. MA-plots of normalized and non-normalized data.

Fig7. MA-plots of normalized and non-normalized data.

Fig7. MA-plots of normalized and non-normalized data.

In the non-normalized plot, we observe there are a lot of genes non differential expreesion (logFC ~ 0) and also several ones with a a low log2CPM which are removed in the normalized plot to avoid artifacts in the posterior analysis. In the normalized plot, we also observe a smother shape with less outliers.

We examine now the MA-plots of the normalized expression profiles for each tumor and normal samples sparately, in order to observe if there is any sample with a anomalous expression profile. We look first to the tumor samples in Figure @ref(fig:maPlotsTumor).

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

Fig8. MA-plots of the tumor samples.

We do not observe samples with major expression-level dependent biases. Let’s look now to the normal samples in Figure @ref(fig:maPlotsNormal).

Fig9. MA-plots of the normal samples.

Fig9. MA-plots of the normal samples.

Fig9. MA-plots of the normal samples.

Fig9. MA-plots of the normal samples.

From normal samples, we can observe there is several with an anomalous profile, so we should remove those samples in order not to have biases in our results

#select more samples from tumor and normal
maskbad <- substr(colnames(thca.filt3), 0, 12) %in% c("TCGA.BJ.A28X", "TGCA.EL.A3ZP", "TCGA.ET.A3DP","TCGA.KS.A41I")
dim(thca.filt3)
[1] 9423  467
dim(dge.filt3)
[1] 9423  467
thca.filt4 <- thca.filt3[, !maskbad]
dge.filt4 <- dge.filt3[, !maskbad]
dim(dge.filt4)
[1] 9423  462
dim(thca.filt4)
[1] 9423  462
saveRDS(thca.filt4, file.path("./", "thca.filt4.rds"))
saveRDS(dge.filt4, file.path("./", "dge.filt4.rds"))

We remove 4 patients with deviant normal samples, but since for one patient we have a tumor and a normal sample (TCGA.BJ.A28X), we end up with 5 samples less.

Once, we have done the normalization, we perform a MDS plot (which is a tailored PCA) in order to visualize possibles samples with distinctive features from the rest, which can idicate potential problem with samples.

plotMDS(dge.filt4, col = c("red", "blue")[as.integer(dge.filt4$samples$group)], cex = 0.7)
legend("topleft", c("normal", "tumor"), fill = c("red", "blue"), inset = 0.05, cex = 0.7)

We can observe two cluster, which indicate a difference between normal and tummor samples. We have to test if this cluster is due to the type tissue or because any batch that confound our interest outcome, in order to continue with further analyses.

Moreover, in normal samples zone we can also observe several tumor samples. If that cluster is because type sample, that can be explained becuse not all thyroid tumors have the same mutations, which produces different levels of severity in cancer patients.

Batch identification

We will search now for potential surrogate of batch effect indicators. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples.

tss <- substr(colnames(thca.filt4), 6, 7)
table(tss)
tss
4C BJ CE DE DJ DO E3 E8 EL EM ET FE FK FY GE H2 IM J8 KS L6 MK 
 1 46  9  9 80  4  7 15 97 68 47 12  6 23  1  8  7  9  7  3  3 
center <- substr(colnames(thca.filt4), 27, 28)
table(center)
center
 07 
462 
plate <- substr(colnames(thca.filt4), 22, 25)
table(plate)
plate
A10U A13Y A14Y A16R A180 A18C A19O A206 A20F A21D A220 A22L A22U A23N A23W 
  20   16   17   20   19   23   55   34   26   25   25   42   14   61   19 
A250 A39I 
  43    3 
portionanalyte <- substr(colnames(thca.filt4), 18, 20)
table(portionanalyte)
portionanalyte
11R 12R 13R 21R 22R 31R 
396  35   1  26   2   2 
samplevial <- substr(colnames(thca.filt4), 14, 16)
table(samplevial)
samplevial
01A 01B 11A 11B 11C 
407   5  48   1   1 

From this information we can make the following observations:

  • All samples were sequenced at the same center (07)

  • In this dataset, we can find six different combinations of tissue type and vial, it can be a potential batch.

  • Samples were collected across different tissue source sites (TSS).

  • Different samples were sequenced with different plates, so it can be also a potential batch.

  • Most of the samples were sequenced using one portion and analyte combinations, but there is other 5 different portion and analyte combinations

From this data we can consider as possible surrogates of batch effect indicator the tssm, plate, portionanalytic and samplevial.

TSS We are going to use the TSS as surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with TSS.

table(data.frame(TYPE=thca.filt4$type, TSS=tss))
        TSS
TYPE     4C BJ CE DE DJ DO E3 E8 EL EM ET FE FK FY GE H2 IM J8 KS L6 MK
  normal  0  8  0  0  0  1  0  1 27  5  3  0  0  1  0  2  0  0  2  0  0
  tumor   1 38  9  9 80  3  7 14 70 63 44 12  6 22  1  6  7  9  5  3  3

Observe that normal tissues from most of TSS are under-represented with respect to the tumor tissues. If TSS is a source of expression variability, this under-representation in the normal samples may lead to a potential confounding effect.

We examine now how samples group together by hierarchical clustering and multidimensional scaling, annotating the outcome of interest and the the surrogate of batch indicator. We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure @ref(fig:sampleClustering).

logCPM <- cpm(dge.filt4, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(tss))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(thca.filt4)
outcome <- paste(substr(colnames(thca.filt4), 9, 12), as.character(thca.filt4$type), sep="-")
names(outcome) <- colnames(thca.filt4)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(tss))), fill=sort(unique(batch)))
Figure S6: Hierarchical clustering of the samples.

Figure S6: Hierarchical clustering of the samples.

We can observe there is not evident clustering depending of the tss, so we can descard it as a possible batch.

In Figure @ref(fig:mdsPlot) we show the corresponding MDS plot. Here we see more clearly that the first source of variation separates tumor from normal samples. We can also observe that two tumor samples, corresponding to individuals KL-8404 and KN-8427 are separated from the rest, just as it happens in the hierchical clustering. A closer examination of their corresponding MA-plots also reveals a slight dependence of expression changes on average expression. We may consider discarding these two samples and doing the MDS plot again to have a closer look to the differences among the rest of the samples and their relationship with TSS.

Plate

table(data.frame(TYPE=thca.filt4$type, plate=plate))
        plate
TYPE     A10U A13Y A14Y A16R A180 A18C A19O A206 A20F A21D A220 A22L A22U
  normal    0    5    0    1    3    4    2    0    5    4    4    8    2
  tumor    20   11   17   19   16   19   53   34   21   21   21   34   12
        plate
TYPE     A23N A23W A250 A39I
  normal   12    0    0    0
  tumor    49   19   43    3
logCPM <- cpm(dge.filt4, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(plate))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(thca.filt4)
outcome <- paste(substr(colnames(thca.filt4), 9, 12), as.character(thca.filt4$type), sep="-")
names(outcome) <- colnames(thca.filt4)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(plate))), fill=sort(unique(batch)))
Figure S8: Hierarchical clustering of the samples by Plate.

Figure S8: Hierarchical clustering of the samples by Plate.

PortionAnalyte

table(data.frame(TYPE=thca.filt4$type, PA=portionanalyte))
        PA
TYPE     11R 12R 13R 21R 22R 31R
  normal  40   7   1   2   0   0
  tumor  356  28   0  24   2   2
logCPM <- cpm(dge.filt4, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(thca.filt4)
outcome <- paste(substr(colnames(thca.filt4), 9, 12), as.character(thca.filt4$type), sep="-")
names(outcome) <- colnames(thca.filt4)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(portionanalyte))), fill=sort(unique(batch)))
Figure S10: Hierarchical clustering of the samples by Portion Analyuc.

Figure S10: Hierarchical clustering of the samples by Portion Analyuc.

Samplevial

table(data.frame(TYPE=thca.filt4$type, SV=samplevial))
        SV
TYPE     01A 01B 11A 11B 11C
  normal   0   0  48   1   1
  tumor  407   5   0   0   0
logCPM <- cpm(dge.filt4, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(samplevial))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(thca.filt4)
outcome <- paste(substr(colnames(thca.filt4), 9, 12), as.character(thca.filt4$type), sep="-")
names(outcome) <- colnames(thca.filt4)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(samplevial))), fill=sort(unique(batch)))
Figure S12: Hierarchical clustering of the samples by Portion Analyuc.

Figure S12: Hierarchical clustering of the samples by Portion Analyuc.

From this hierchical clustering plot, we can observe that there is a different cluster depending of the samplevial we use, so they can confounding with the interest outcome we want. For that reason, we are going to adjust or remove this batch in order to get a non-biased results.

plotMDS(dge.filt4, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(samplevial))),
       fill=sort(unique(batch)), inset=0.05)
Figure S11: Multidimensional scaling plot of the samples.

Figure S11: Multidimensional scaling plot of the samples.

REAL BATCH!!! ## Session information

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=ca_ES.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=ca_ES.UTF-8        LC_COLLATE=ca_ES.UTF-8    
 [5] LC_MONETARY=ca_ES.UTF-8    LC_MESSAGES=ca_ES.UTF-8   
 [7] LC_PAPER=ca_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] geneplotter_1.56.0         annotate_1.56.2           
 [3] XML_3.98-1.11              AnnotationDbi_1.40.0      
 [5] lattice_0.20-35            edgeR_3.20.9              
 [7] limma_3.34.9               SummarizedExperiment_1.8.1
 [9] DelayedArray_0.4.1         matrixStats_0.53.1        
[11] Biobase_2.38.0             GenomicRanges_1.30.3      
[13] GenomeInfoDb_1.14.0        IRanges_2.12.0            
[15] S4Vectors_0.16.0           BiocGenerics_0.24.0       
[17] knitr_1.20                

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16           RColorBrewer_1.1-2     compiler_3.4.4        
 [4] highr_0.6              XVector_0.18.0         bitops_1.0-6          
 [7] tools_3.4.4            zlibbioc_1.24.0        digest_0.6.15         
[10] bit_1.1-12             evaluate_0.10.1        RSQLite_2.1.0         
[13] memoise_1.1.0          Matrix_1.2-14          DBI_0.8               
[16] yaml_2.1.18            GenomeInfoDbData_1.0.0 stringr_1.3.0         
[19] locfit_1.5-9.1         rprojroot_1.3-2        bit64_0.9-7           
[22] grid_3.4.4             rmarkdown_1.9          blob_1.1.1            
[25] magrittr_1.5           backports_1.1.2        codetools_0.2-15      
[28] htmltools_0.3.6        xtable_1.8-2           KernSmooth_2.23-15    
[31] stringi_1.1.7          RCurl_1.95-4.10